library(readr)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%--%() masks igraph::%--%()
## ✖ dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## ✖ purrr::compose() masks igraph::compose()
## ✖ tidyr::crossing() masks igraph::crossing()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::simplify() masks igraph::simplify()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(visNetwork)
The data is about characters in Victor Hugo’s book Les Miserables and it is taken from GitHub (https://github.com/MADStudioNU/lesmiserables-character-network/tree/master). The data is composed by two files which we will name “nodes” and “edges”
nodes <- read_csv("jean-complete-node.csv")
head(nodes)
## # A tibble: 6 × 3
## Id Label Description
## <chr> <chr> <chr>
## 1 TH Thénardier Thénardier, innkeeper in Montfermeil, aka Jondrette
## 2 TG Théodule Lieutenant Théodule Gillenormand, grandnephew of Gill…
## 3 TE German teamster German teamster
## 4 TC Three concierges Three concierges, met by Gvaroche
## 5 GV Government troops Government troops
## 6 GU Gueulemer Gueulemer, member of Patron-Minette
edges <- read_csv("jean-complete-edge.csv")
head(edges)
## # A tibble: 6 × 5
## Source Target Type Id Label
## <chr> <chr> <chr> <dbl> <chr>
## 1 MY NP Undirected 1 1.1.1
## 2 MB MY Undirected 2 1.1.1
## 3 ME MY Undirected 3 1.1.1
## 4 MB ME Undirected 4 1.1.1
## 5 HD MY Undirected 5 1.1.2
## 6 MB MY Undirected 6 1.1.2
The first table shows the structure of the dataset
nodes, which contains the specification of each
character in the book, which then corresponds to a node. The dataframe
is composed as follows: - Id: the unique abbreviation of
the character’s name - Label: the complete name of the
character - Description: a brief description of the
character in the book
The second table shows the structure of the dataset
edges, the core of our analysis, in which are indicated
and encoded all the interactions between the characters in the book. The
structure of the dataframe is the following: - Source: the
node (character, defined by the Id) from which the interaction
starts - Target: the node (character) from which the
interaction ends - Type: here is specified the kind of
interaction, in our case are all undirected -
Id: a numerical sequential encoding to keep track of all
the interactions - Label: the corresponding
chapter.paragraph.subparagraph in which the interaction takes
place
For the sake of simplicity, we will rename the columns “Source” and “Target” of the dataset into “from” and “to” respectively in the dataset Edges and in lowercase the columns in the dataset nodes.
nodes <- nodes %>%
rename("id" = "Id",
"label" = "Label")
edges <- edges %>%
rename("from" = "Source",
"to" = "Target")
Before starting the analysis we need to check for Nas and in case remove them.
table(is.na(edges))
##
## FALSE TRUE
## 7944 1
which(is.na(edges))
## [1] 2686
edges[1097,]
## # A tibble: 1 × 5
## from to Type Id Label
## <chr> <chr> <chr> <dbl> <chr>
## 1 BO <NA> Undirected 1097 4.12.2
edges <- na.omit(edges)
table(is.na(edges))
##
## FALSE
## 7940
We now check if there are duplicate interactions between characters. To do so we create a new dataframe taking into consideration the already existing dataframe edges and group by the columns “from” and “to”; we then count the number of occurrences of these interactions.
multiple_edges <- edges %>%
group_by(from, to) %>%
summarize(n = n()) %>%
group_by(from, to) %>%
filter(n > 1)
head(multiple_edges)
## # A tibble: 6 × 3
## # Groups: from, to [6]
## from to n
## <chr> <chr> <int>
## 1 AM GP 3
## 2 AZ CO 5
## 3 AZ EP 11
## 4 AZ JV 3
## 5 AZ TH 10
## 6 AZ TM 6
We now merge the dataframes “edges” and “multiple_edges” in order to include the count of interactions between characters, making sure to take the distinct count of each interaction.
merged_edges <- edges %>%
left_join(multiple_edges, by = c("from", "to")) %>%
mutate(n = ifelse(is.na(n), 1, n)) %>%
distinct(from, to, .keep_all = TRUE)
head(merged_edges)
## # A tibble: 6 × 6
## from to Type Id Label n
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 MY NP Undirected 1 1.1.1 1
## 2 MB MY Undirected 2 1.1.1 11
## 3 ME MY Undirected 3 1.1.1 9
## 4 MB ME Undirected 4 1.1.1 7
## 5 HD MY Undirected 5 1.1.2 1
## 6 MG MY Undirected 7 1.1.2 1
The above dataframe is the one we’ll use for our analysis.
For visualization purposes we’ll transform into graph both the
edges dataframe (which contains duplicate
interactions). Then, we’ll transform into a weighted graph the
merged_edges dataframe. The weights of each
edge will be based on the number of interactions between the characters,
represented in the dataframe by n.
# graph with duplicate interactions
g <- graph_from_data_frame(edges, directed = FALSE)
# graph without duplicate interactions, but weighted
g_w <- graph_from_data_frame(merged_edges, directed=FALSE)
# Set the weights of the edges based on the 'n' column
E(g_w)$weight <- merged_edges$n
hist(E(g_w)$weight, breaks = 25,
xlab= "Weights", main="Edges' Weight Distribution", col = 'blue', border = 'white')
table(E(g_w)$weight)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 18 20 21 22 24 32
## 232 79 48 27 32 21 11 6 4 6 5 1 2 4 1 2 1 1 1 1
## 34 37 62
## 1 1 1
The following is the plot of the multiedged and not-weighted graph:
vis_g <- toVisNetworkData(g)
visNetwork(
nodes = vis_g$nodes,
edges = vis_g$edges,
width = "100%",
height = '600px'
)
The following is the simplified, but weighted graph:
vis_g_w <- toVisNetworkData(g_w)
visNetwork(
nodes = vis_g_w$nodes,
edges = vis_g_w$edges,
width = "100%",
height = '600px'
)
We now proceed with our analysis looking at Degree Distribution, Betweenness, Strength and Transitivity. ## Degree
table(degree(g_w))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 29 33 35 87
## 76 20 16 8 5 2 15 6 4 1 3 2 2 3 4 2 1 1 2 1 2 1 1 1 1
hist(degree(g_w), breaks = 50, ylim = c(0,110),
xlab= "Degree",main = "Histogram of degree")
The code
degree(g_w) calculates the degree of each
vertex in the graph, more in details the degree of a vertex in
a graph represents the number of edges connected to that vertex. From
the above chunk we see that the highest degree is 87,
while most of the other edges range from degree 1 to
7.
t <- transitivity(g_w)
t
## [1] 0.2894354
In general, transitivity refers to the measure of the likelihood that two vertices connected to a common neighbor are also connected to each other. It is calculated as the ratio of the number of triangles in a graph to the number of “open triads”, where a triad is a set of three vertices, and an open triad is a triad where at least one edge is missing. In this case, we have a transitivity value of 0.2894354 suggests a moderate level of clustering in the graph.
s <- graph.strength(g_w)
hist(s, breaks = 30, ylim = c(0,130), main= "Histogram of Strength",
xlab= "Strength")
table(s)
## s
## 1 2 3 4 5 6 7 8 9 11 12 13 15 16 18 20 22 23 27 32
## 70 16 11 10 6 4 3 1 1 1 4 4 1 3 3 2 1 2 3 5
## 33 35 36 37 39 42 49 51 53 55 57 58 78 81 85 88 89 95 123 124
## 1 3 2 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1
## 146 157 197 312
## 1 1 1 1
The code graph.strength(g_w) calculates the
strength of each vertex in the graph, more in details
the strength of a vertex in a graph is a measure of the sum of weights
of the edges connected to that vertex. Hence, this measure is
particularly useful when working with weighted graphs, which is the case
of our graph where, as a remark, the weights of the edges are given by
the count of the interactions between characters in the whole book.
b <- betweenness(g_w)
hist(b, breaks = 40, main= "Histogram of Betweenness", xlim=c(0, 10000),
xlab= "Betweenness", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
The code
betweenness(g_w) calculates the
betweenness centrality of each vertex in the graph,
more in details betweenness centrality is a measure of how
central or influential a vertex is in a graph based on
the concept of shortest paths. In general, vertices with higher
betweenness are considered more central in the network as they tend to
lie on a higher number of shortest paths between other pairs of
vertices.
plot(s, b, ylab="Betweenness", xlab= "Strength", main= "Betweenness vs Strength", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
The above scatter plot shows the relationship between the vertex
strength and their betweenness entrality in the graph. Each point on the
plot represents a vertex, and its position corresponds to the strength
and betweenness centrality values of that vertex. In general, vertices
with higher strength values tend to have higher or lower betweenness
centrality, indicating whether more influential vertices also tend to
have stronger connections in the graph. In this case, we see that there
is a particular observation having high strength and high betweenness,
suggesting that that particular point might be an important
character. Hence, we now investigate on who are the important
actors in our network.
ia <- order(b, decreasing=T)[1]
name <- names(which.max(b))
name_out <- nodes[nodes$id == name, ][2, 2]
namename <- name_out$label
cat('The most important actor in the network is',toupper(namename),'\n')
## The most important actor in the network is JEAN VALJEAN
We now look at the degree distribution:
degree_dist <- function(graph) {
fd <- table(degree(graph))
d <- as.numeric(names(fd)) + 1
list(d = d, fd = fd)
}
dd <- degree_dist(g_w)
with(dd, plot(log(d), log(fd), main="Degree Distribution", xlab="Degree", ylab="Frequency of each Degree", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues')))
The
degree_dist() function calculates the degree
distribution of the graph, which means it computes the frequency
distribution of the degrees (number of edges incident to each vertex) in
the graph and it then returns a list with two elements: the degree
values and their corresponding frequency counts. We then plot and
visualize the degree distribution in logarithmic scales, more
specifically to check if the degree distribution follows a power-law
pattern.
We then plot and visualize the degree distribution in logarithmic scales, more specifically to check if we have to deal with an Heterogenous or Homogenous degree distribution. From the scatterplot we can state that our network followws an Heterogenous Degree Distribution, recalling it refers to a situation where the degrees of nodes (or vertices) in the network are not evenly distributed. It means that some nodes have a significantly higher number of connections (high degree), while others have relatively fewer connections (low degree), as proved by the degree table in the first part of the analysis. This characteristic implies that we must look for a power-law distribution, which models this kind of behaviors.
Now we are going to start the modelling of the network, to do so we will use three main models: Linear Model, Generalised Linear Model (Poisson), Exponential Random Graph Model (ERGM).
mod0 <- lm(log(fd) ~ log(d), data=dd)
cat('model with the fd transformed:', '\n')
## model with the fd transformed:
summary(mod0)
##
## Call:
## lm(formula = log(fd) ~ log(d), data = dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26840 -0.36630 0.00667 0.28918 1.15824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0667 0.4010 10.140 5.87e-10 ***
## log(d) -1.1670 0.1499 -7.785 6.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6182 on 23 degrees of freedom
## Multiple R-squared: 0.7249, Adjusted R-squared: 0.7129
## F-statistic: 60.61 on 1 and 23 DF, p-value: 6.824e-08
From the above result of the linear model we have that the estimated intercept value is 4.0667, which represents the expected log-transformed frequency when the log-transformed degree is zero. The estimated coefficient for the log-degree is -1.1670, which indicates the expected change in the log-transformed frequency for a one-unit increase in the log-transformed degree. Moreover, we see a Multiple R-squared 0.7249, indicating that 72.49% of the variance in the response variable can be explained by the model and, looking at the overall significance of the model, the extremely low p-value (6.824e-08) suggests strong evidence against the null hypothesis, indicating that the relationship is significant.
mod1 <- glm(fd ~ log(d), family = poisson, data=dd)
cat('model without the fd transformed:', '\n')
## model without the fd transformed:
summary(mod1)
##
## Call:
## glm(formula = fd ~ log(d), family = poisson, data = dd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.38608 -0.48423 -0.03714 0.57033 3.08011
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.2376 0.1639 31.96 <2e-16 ***
## log(d) -1.6570 0.1020 -16.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 360.262 on 24 degrees of freedom
## Residual deviance: 39.314 on 23 degrees of freedom
## AIC: 118.78
##
## Number of Fisher Scoring iterations: 5
The above model analyses the relationship between the frequency of degrees and the log-degrees in the network. From the results we see that the estimated intercept is 5.2376, and the estimated coefficient for log(d) is -1.6570. This last value represents the average effect on the log of the expected frequency for a one-unit change in the logarithm of degrees. Moreover both coefficients have extremely small p-values (<2e-16), indicating a highly significant relationship between the predictor variable and the response variable. Added to that, the deviance values indicate that the model explains a substantial portion of the variation in the data and the relatively low AIC value suggests a good balance between model complexity and fit to the data.
with(dd, plot(log(d),log(fd),main="Degree Distribution", xlab="Degree", ylab="Frequency of each Degree", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues')))
abline(a=mod0$coef[1],b=mod0$coef[2], col='red', lwd=3)
abline(a=mod1$coef[1],b=mod1$coef[2], col='blue', lwd=3)
legend("topright", legend = c("Linear Model", "Poisson"), col = c("red", "blue"), lwd = 2)
# aggiungere legend
FINISCI From the above plot, the model which fits the best is the GLM.
library(ergm)
am <- get.adjacency(g_w, sparse = FALSE)
g_ergm <-as.network(am, directed = FALSE)
ergm(g_ergm~edges) %>% summary
## Call:
## ergm(formula = g_ergm ~ edges)
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -3.46612 0.04596 0 -75.41 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 22333 on 16110 degrees of freedom
## Residual Deviance: 4374 on 16109 degrees of freedom
##
## AIC: 4376 BIC: 4384 (Smaller is better. MC Std. Err. = 0)
In general, the Exponential Random Graph Model is a statistical model used to analyze and understand the patterns and processes in network data and it estimates the effect of the edges predictor on the network structure. In this case, the estimated coefficient for the edges predictor is -3.46612. This negative coefficient suggests that the presence of edges in the network is associated with a decrease in the network structure according to the mode hence, the presence of edges has a negative effect on the network. Moreover, the model suggests that the network tends to have fewer edges than would be expected by chance.
We now perform some Clustering Analysis using different methods such as fast greedy algorithm, edge betweenness clustering algorithm, and Louvain community detection algorithm. Furthermore, we’ll also look at the modularity scores which quantifies the quality of the community structure within the network.
g_kc <- graph_from_data_frame(merged_edges, directed=FALSE)
kc <- cluster_fast_greedy(g_kc)
In general, cluster_fast_greedy() function partitions
the graph into communities based on the connectivity patterns
between vertices and it aims to find a division of the graph that
maximizes the modularity score. The resulting communities represent
groups of vertices that are more densely connected within their own
group compared to connections between different groups.
l_kc <- length(kc)
cat('The number of clusters by the Fast Greedy Algorithm are:',l_kc,'\n')
## The number of clusters by the Fast Greedy Algorithm are: 7
mod_kc <- modularity(kc)
cat('The Modularity Coefficient is:',mod_kc,'\n')
## The Modularity Coefficient is: 0.5752339
table(membership(kc))
##
## 1 2 3 4 5 6 7
## 40 56 18 24 19 20 3
We now plot the clustering method based on the Fast Greedy Algorithm:
set.seed(123456)
V(g_kc)$community <- kc$membership
colors <- adjustcolor(col = c("red", "orange", "yellow", "green", "blue", "#4b0082", "violet"), alpha=1)
plot(kc, g_kc, vertex.size=5, , vertex.color=colors[V(g_kc)$community], vertex.label=NA, asp=.5, main="Fast Greedy Clustering")
# dendrogram based on kc
## add a LEGEND
dend_col <- c("red", "orange", "yellow", "green", "blue", "#4b0082", "violet")
par(cex=.4)
plot_dendrogram(kc, mode = 'hclust', colbar=dend_col, axes=FALSE)
legend("topright", legend=c(1:length(kc)), col=dend_col, lwd=2)
g_ceb <- graph_from_data_frame(merged_edges, directed=FALSE)
ceb <- cluster_edge_betweenness(g_ceb)
In general, the edge betweenness algorithm works by iteratively removing edges with the highest betweenness centrality and as edges are removed, the graph is divided into communities based on the remaining connected components. This process continues until all edges are removed, resulting in a hierarchical structure of communities.
l_ceb <- length(ceb)
cat('The number of clusters by Edge Betweenness Algorithm are:',l_ceb,'\n')
## The number of clusters by Edge Betweenness Algorithm are: 46
mod_ceb <- modularity(ceb)
cat('The Modularity Coefficient is:',mod_ceb,'\n')
## The Modularity Coefficient is: 0.5086439
table(membership(ceb))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 14 3 8 2 1 4 27 1 2 3 1 1 16 18 18 1 3 4 2 2 1 1 1 5 2 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## 1 8 9 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1
We now plot the clustering method based on edge betweenness algorithm:
# Plotting the histogram
hist(membership(ceb), breaks = 30, xlim=c(0,50), xlab= "Clusters", main= "Cluster Memebership distribution", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
set.seed(123456)
V(g_ceb)$community <- ceb$membership
plot(ceb, g_ceb, vertex.size=5, vertex.label=NA, asp=.5, main="Edge Bewteennes Clustering")
dendPlot(ceb, cex= .4)
# Louvain Comunity Detection
g_hc <- graph_from_data_frame(merged_edges, directed=FALSE)
cluster <- cluster_louvain(g_hc)
cluster_df <- data.frame(as.list(membership(cluster)))
cluster_df <- as.data.frame(t(cluster_df))
cluster_df$id <- rownames(cluster_df)
# Create group column
nodes <- left_join(nodes, cluster_df, by = "id")
colnames(nodes)[4] <- "group"
In general, the Louvain Algorithm is a widely used and efficient method for community detection in networks. It optimizes the modularity of the network by iteratively moving vertices between communities to increase the modularity score. The algorithm starts with each vertex in its own community and iteratively merges communities to maximize modularity.
l_hc <- length(cluster)
cat('The number of clusters by Louvain Algorithm are:',l_hc,'\n')
## The number of clusters by Louvain Algorithm are: 8
mod_hc <- modularity(cluster)
cat('The modularity Coefficient is:',mod_hc,'\n')
## The modularity Coefficient is: 0.5774447
We now plot the clustering method based on the Louvain algorithm:
set.seed(123456)
plot(cluster, g_hc, vertex.size=5, vertex.label=NA, asp=.5, main="Louvain Clustering")
table(membership(cluster))
##
## 1 2 3 4 5 6 7 8
## 24 17 54 3 20 19 35 8
visNetwork(nodes, merged_edges, width = "100%") %>%
visIgraphLayout() %>%
visNodes(
shape = "dot",
color = list(
background = "#0085AF",
border = "#013848",
highlight = "#FF8000"
),
shadow = list(enabled = TRUE, size = 10)
) %>%
visEdges(
shadow = FALSE,
color = list(color = "#0085AF", highlight = "#C62F4B")
) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
selectedBy = "group") %>%
visLayout(randomSeed = 11)
mod_tab <- cbind(mod_kc, mod_ceb, mod_hc)
mod_tab
## mod_kc mod_ceb mod_hc
## [1,] 0.5752339 0.5086439 0.5774447
Recall that a higher modularity score generally indicates a better
division of the network into communities or clusters. Therefore, looking
at the table above, we can say that mod_hc, which
corresponds to the Louvain Clustering Method has the highest
modularity score, suggesting that it provides a better clustering of the
network compared to mod_kc and mod_ceb.
After the analysis of the Network we decided to subgraph the network
in 5 subgraphs based on the chapter divisions (Label in
edges dataframe), in order to have a more deep understanding of
the network’s actors within each chapter.
grouped_chap <- edges %>%
mutate(Label = str_extract(Label, "^\\d+"))
subgroups <- grouped_chap %>%
group_split(Label)
chp <- vector(mode = "list", length = 5)
for (i in 1:length(subgroups)) {
multiple_edges <- subgroups[[i]] %>%
group_by(from, to) %>%
summarize(n = n())
chp[[i]] <- data.frame()
chp[[i]] <- subgroups[[i]] %>%
left_join(multiple_edges, by = c("from", "to")) %>%
mutate(n = ifelse(is.na(n), 1, n)) %>%
distinct(from, to, .keep_all=TRUE)
}
g_chp <- vector(mode = "list", length = 5)
for (k in 1:length(chp)) {
g_chp[[k]] <- graph_from_data_frame(chp[[k]], directed=FALSE)
cat("Is the graph associated to chapter", k,"simple: ",is.simple(g_chp[[k]]), "\n")
}
## Is the graph associated to chapter 1 simple: TRUE
## Is the graph associated to chapter 2 simple: TRUE
## Is the graph associated to chapter 3 simple: TRUE
## Is the graph associated to chapter 4 simple: TRUE
## Is the graph associated to chapter 5 simple: TRUE
vis_g_c1 <- toVisNetworkData(g_chp[[1]])
visNetwork(
nodes = vis_g_c1$nodes,
edges = vis_g_c1$edges,
width = "100%",
height = '600px'
)
vis_g_c2 <- toVisNetworkData(g_chp[[2]])
visNetwork(
nodes = vis_g_c2$nodes,
edges = vis_g_c2$edges,
width = "100%",
height = '600px'
)
vis_g_c3 <- toVisNetworkData(g_chp[[3]])
visNetwork(
nodes = vis_g_c3$nodes,
edges = vis_g_c3$edges,
width = "100%",
height = '600px'
)
vis_g_c4 <- toVisNetworkData(g_chp[[4]])
visNetwork(
nodes = vis_g_c4$nodes,
edges = vis_g_c4$edges,
width = "100%",
height = '600px'
)
vis_g_c5 <- toVisNetworkData(g_chp[[5]])
visNetwork(
nodes = vis_g_c5$nodes,
edges = vis_g_c5$edges,
width = "100%",
height = '600px'
)
t_c1 <- transitivity(g_chp[[1]])
t_c2 <- transitivity(g_chp[[2]])
t_c3 <- transitivity(g_chp[[3]])
t_c4 <- transitivity(g_chp[[4]])
t_c5 <- transitivity(g_chp[[5]])
trans_tab <- cbind(t_c1, t_c2, t_c3,t_c4,t_c5)
trans_tab
## t_c1 t_c2 t_c3 t_c4 t_c5
## [1,] 0.2284908 0.33 0.5451807 0.4928937 0.5196305
par(mfrow = c(2, 3))
d_c1 <- degree(g_chp[[1]])
d_c2 <- degree(g_chp[[2]])
d_c3 <- degree(g_chp[[3]])
d_c4 <- degree(g_chp[[4]])
d_c5 <- degree(g_chp[[5]])
hist(d_c1, breaks=15, xlab="Degree", main="Degree of Chapter1", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(d_c2, breaks=15, xlab="Degree", main="Degree of Chapter2", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(d_c3, breaks=15, xlab="Degree", main="Degree of Chapter3", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(d_c4, breaks=15, xlab="Degree", main="Degree of Chapter4", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(d_c5, breaks=15, xlab="Degree", main="Degree of Chapter5", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(degree(g_w), xlab="Degree", main="Degree of Full Book", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
### Strength subgraphs
par(mfrow = c(2, 3))
s_c1 <- graph.strength(g_chp[[1]])
s_c2 <- graph.strength(g_chp[[2]])
s_c3 <- graph.strength(g_chp[[3]])
s_c4 <- graph.strength(g_chp[[4]])
s_c5 <- graph.strength(g_chp[[5]])
hist(s_c1, main="Strength of chapter1", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(s_c2, main="Strength of chapter2", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(s_c3, main="Strength of chapter3", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(s_c4, main="Strength of chapter4", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(s_c5, main="Strength of chapter5", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(s, main="Strength of Full Book", xlab="strength", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
### Betweenness subgraphs
par(mfrow = c(2, 3))
b_c1 <- betweenness(g_chp[[1]])
b_c2 <- betweenness(g_chp[[2]])
b_c3 <- betweenness(g_chp[[3]])
b_c4 <- betweenness(g_chp[[4]])
b_c5 <- betweenness(g_chp[[5]])
hist(b_c1, main="Betweenness of Chapter1", xlab="Betweenness", breaks=20, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(b_c2, main="Betweenness of Chapter2", xlab="Betweenness", breaks=20, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(b_c3, main="Betweenness of Chapter3", xlab="Betweenness", breaks=10, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(b_c4, main="Betweenness of Chapter4", xlab="Betweenness", breaks=10, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(b_c5, main="Betweenness of Chapter5", xlab="Betweenness", breaks=10, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
hist(b, main="Betweenness of Full Book", xlab="Betweenness", breaks=30, col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
par(mfrow = c(2, 3))
plot(s_c1, b_c1, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Chapter1", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
plot(s_c2, b_c2, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Chapter2", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
plot(s_c3, b_c3, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Chapter3", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
plot(s_c4, b_c4, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Chapter4", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
plot(s_c5, b_c5, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Chapter5", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
plot(s,b, ylab="Betweenness", xlab= "Strength", main="Bet vs Stre in Full Book", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
b_c <- list(b_c1, b_c2, b_c3, b_c4, b_c5)
for (i in 1:length(g_chp)){
ia <- order(b_c[[i]], decreasing=T)[1:2]
name <- names(b_c[[i]][ia])
name_out <- nodes[nodes$id == name, ]
name_out_1 <- nodes[nodes$id == name[1], ]; name_out_1 <- na.omit(name_out_1)
name_out_2 <- nodes[nodes$id == name[2], ]; name_out_2 <- na.omit(name_out_2)
namename_1 <- name_out_1$label
namename_2 <- name_out_2$label
cat('The important actors in chaptet', i,'are:',toupper(namename_1), 'and', toupper(namename_2),'\n')
}
## The important actors in chaptet 1 are: JEAN VALJEAN and FANTINE
## The important actors in chaptet 2 are: JEAN VALJEAN and COSETTE
## The important actors in chaptet 3 are: MARIUS and COLONEL PONTMERCY
## The important actors in chaptet 4 are: GAVROCHE and M. MABEUF
## The important actors in chaptet 5 are: JEAN VALJEAN and MARIUS
In conclusions the aim of the project was to analyze the dataset “Les Miserables”, based on Victor Hugo’s masterpiece, but from the perspective of a Network representation. In doing this Network Analysis we looked at descriptive characteristics such as Degree, Strength, betwenness and Transitivity. Moreover, we then looked for the “Important Actor” in the Network, discovering as may expected, it matched with the Main Character of the book (JEAN VALJEAN). After that we examined the Degree Distribution Analysis of the Network looking for the Model which better explained the structure of it. We then performed Clustering Analysis, computing different methods, in order to find out the best way to divide our network into Communities. Lastly, we decided to divide the Network not following clustering algorithms, but using instead the insights of the dataset, meaning the Chapter Label. From this last analysis, we found out how the different descriptive characteristics of the network changed between the different chapters and also how the characters are distributed in the book.
Link to our project repository: (https://github.com/gvector/LesMiserables.git)